We have learned that Convolutional Neural Networks are excellent at classifying images. They process multiple layers to extract features from raw data and discover the hierarchical representations needed for different kinds of tasks (LeCun et al., 2015). However, how Neural Networks build up their understanding for input images? What kinds of patterns have been captured in each layer? In this way, feature visualization would help us to know more about the inner working mechianisms.
In this project, we are trying to recreate some of the techniques described in papers we have read and project the features from different layers in CNN backward to pixel space.
Finally, after implementing our visualization methods on MNIST (28x28) and CIFAR-10 (32x32) datasets, we decided to choose another dataset, STL-10, which has a higher resolution (96x96) to be our main dataset and did further study.
For STL-10 dataset, we tried to visualize maximal activation maps from the first and second convolutional layer, and visualize saliency maps by taking gradients of logits values with respect to input images, as well as taking gradients of tensor valus from other higher layers. We also tried to visualize last hidden layer (192-dim features received from trained model) by dimension reduction method (tSNE) and comparing nearest neighbors on feature space and pixel space. At last, we used activation maximization to generate images for visualizing features, including five different regularization methods to improve recognition. In order to confirm what we implement was correct, we tested the output scores of images from each class, as well as applying it on MNIST dataset and comparing the performances.
Notes:
Datasets: STL-10, CIFAR-10, MNIST
The Main part of this report contains our implementations and findings on STL-10 dataset.
The Appendix includes our analysis on MNIST and CIFAR-10 dataset.
import os,sys,os.path
import numpy as np
import seaborn as sns
sns.reset_orig()
import pandas as pd
from scipy.spatial import distance
from sklearn.manifold import TSNE
import tensorflow as tf
import tensorflow.contrib.slim as slim
import matplotlib.pyplot as plt
%matplotlib inline
from tensorflow.examples.tutorials.mnist import input_data
import math
import pickle
from IPython.display import Image
The STL-10 dataset is inspired by the CIFAR-10 dataset but with some modifications.
Note: Detailed functions related to this dataset are put in STL10.py.
from STL10 import *
First, we extracted pixel values with labels, and correspondingly one-hot encoding labels from raw 13000 labeled images.
images, labels, raw_labels= read_all_images('./data/stl10_binary/')
Second, we need to define and train a convolutional network. We built a netwrok that is pretty standardforward --- two convolutional layers, each with 3x3 maxpooling and a relu gate, followed by three fully connected layers and a softmax classifier. The outputs for the last fully connected layer were 10 scores for each input images.
We have already trained this network for 2001 steps with roughly 93% accuracy at the end. It's pretty good. So in the following main parts of features visualization, we can directly load the saved model, containing the best weight and bais values.
############### 7-Layer CNN ##############
######### AlexNet (some changes) #########
################ STL-10 ################
n1 = 96
n2 = 96
n3 = 384
n4 = 192
batch_size=100
x = tf.placeholder(tf.float32, [None, 96, 96, 3], name='x')
y = tf.placeholder(tf.float32, [None, 10], name='y')
pkeep = tf.placeholder(tf.float32, name='pkeep')
# CNN_layer1
W_conv1 = tf.get_variable('W_conv1', shape=[5, 5, 3, n1])
b_conv1 = tf.get_variable('b_conv1', shape=[n1])
h_conv1 = tf.nn.relu(tf.add(conv2d(x, W_conv1), b_conv1))
# Pool_layer1
h_pool1 = max_pool33(h_conv1)
# CNN_layer2
W_conv2 = tf.get_variable('W_conv2', shape=[3, 3, n1, n2])
b_conv2 = tf.get_variable('b_conv2', shape=[n2])
h_conv2 = tf.nn.relu(tf.add(conv2d(h_pool1, W_conv2), b_conv2))
# Pool_layer2
h_pool2 = max_pool33(h_conv2)
# FC_layer1
h_pool2_flat = tf.reshape(h_pool2, [-1, 24*24*n2])
W_fc1 = tf.get_variable('W_fc1', shape=[24*24*n2, n3])
b_fc1 = tf.get_variable('b_fc1', shape=[n3])
h_fc1 = tf.nn.relu(tf.add(tf.matmul(h_pool2_flat, W_fc1), b_fc1))
# FC_layer2
W_fc2 = tf.get_variable('W_fc2', shape=[n3, n4])
b_fc2 = tf.get_variable('b_fc2', shape=[n4])
h_fc2 = tf.nn.relu(tf.add(tf.matmul(h_fc1, W_fc2), b_fc2))
# FC_layer3
W_fc3 = tf.get_variable('W_fc3', shape=[n4, 10])
b_fc3 = tf.get_variable('b_fc3', shape=[10])
logits = tf.add(tf.matmul(h_fc2, W_fc3), b_fc3)
# loss = compute_cross_entropy(logits=logits, y=y)
# accuracy = compute_accuracy(logits, y)
# train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
# Validation set
batch_all = random_batch(data_images=images[10000:, :, :, :],
data_labels=labels[10000:, :],
batch_size=100)
valid_img = batch_all[0]
valid_lab = batch_all[1]
# train_CNN_stl10(input_images=images, input_labels=labaels, batch_size=100, num_step=2001)
tf.reset_default_graph()
with tf.Session() as sess:
# Load saved model
new_saver = tf.train.import_meta_graph("./output/stl10/alex_on_stl10-2000.meta")
new_saver.restore(sess, tf.train.latest_checkpoint('./output/stl10/'))
In this part we are trying to visualize the maximal activation maps for different input images in each convolutional layer.
We have defined 96 filters, the same as activation maps, for each convolutional layer. The intuition to display these features is pretty straightforward --- for each input image, we retrieved all activation maps, trying to find the maximal one, and set other activation maps as 0, including the ones for other input images, then passed it backward through the network until reaching the pixel layer.
The most challenge thing here is how to make the features we found back to the pixel layer. We have read some papers that have introduced a structure called the deconvolutional network ('deconvnet'). That is, given a high-level feature map, the ’deconvnet’ inverts the data flow of a CNN, going from neuron activations in the given layer down to an image, then the resulting reconstructed image shows the part of the input image that is most strongly activating this neuron. This detailed explanation refers to the paper Striving for Simplicity: The All Convolutional Net (arXiv:1412.6806 [cs.LG]).
In practice, it can be simply achieved by introducing a regular convolutional layer with its filter transposed. However, for the maxpooling layer, it is hard to reconstruct since it is not invertible. In the paper Striving for Simplicity: The All Convolutional Net (arXiv:1412.6806 [cs.LG]), the authors have intoruduced a method that compute the position of maxima within each pooling region in the forward pass, and then these positions are used in the 'deconvnet' to obtain reconstrction. In this project, we used a simplest method, that is, we reconstructed the layer by setting the whole 3x3 squares equal to the maxima within this pooling region. Undoubtedly, this pooling method will influence the outcomes, especially for the reconstruction from the second convolutional layer.
display_max_activations_CONV1_stl10(input_images=images, input_labels=labels, batch_size=100, n1=96)
From the plots shown above we can see that, the first convolutional layer have carefully learned some patterns from input images, for example, every possible edge and line from the raw images. The outlines of the main objects were better reconstructed especially when they could be easily distingusihed from the background, like for the car, bird, airplane and ship.
As you can see, No.29 activation map has been shown for almost all the images, which at some points indicated that this filter activated maximally for most of the pictures given, not only capturing the blue color, but all possible edges and lines from the raw images.
display_max_activations_CONV2_stl10(input_images=images, input_labels=labels, batch_size=100, n1=96, n2=96)
From the plots shown above we can see that, the second convolutional layer have tried to learn, or distingush, the outlines of main objects from the backgrounds by the presence of distinct color patterns.
Intuitively, more prominent the objects are, the better for reconstruction processes. For example, the images for airplanes and birds were always with blue blackgroud, for dogs and horses were with green background, while the objects themselves owned some distinct colors that were totally different from the blue or green colors, such as yellow or black. In this way, the reconstructions for those images from the second convolutional layer were pretty good.
In this part, we are trying to do saliency visualizations for given images.
As introduced in the paper Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (arXiv:1312.6034 [cs.CV]), we first got the logits values, the unnormalized class scores, for each imput image from the last fully connected layer in our saved model, and then using back-propogation to calculate the gradients of these class scores with respect to the pixel values of input images. To derive a saliency value for each pixel(i, j) of each image, we took the maximum magnitude for the absolute values of gradient outputs across 3 color channels and 10 classes.
Grad_to_images_stl10(layer_name="Add_4:0", input_images=images, input_labels=labels, batch_size=100)
Image("./img/saliency1.png", width = 1200, height = 1200)
From the plots shown above we can see that, the outlines of the main objects for input images are not very clear, but still can figure out. These saliency maps at some points tell us which pixels matter for the classifications.
In this part, we are trying to do back-propagation visualization for given images from a specific layer in CNN.
Instead of getting the logits values from the last fully connected layer as previous, we got the tensor with specific input images from other higher layers, such as the second convolutional layer in our CNN, and then using back-propogation to calculate the gradients of these tensor valeus with respect to the pixel values of input images. After that, we took the maximum magnitude for the absolute values of gradient outputs across 3 color channels and all activation maps for that layer.
Grad_to_images_stl10(layer_name="MaxPool_1:0", input_images=images, input_labels=labels, batch_size=100)
Image("./img/back-prop1.png", width = 1200, height = 1200)
Image("./img/back-prop4.png", width = 1200, height = 1200)
From the plots shown above we can see that, the outlines of the main objects for input images are much clearer than the ones shown in saliency maps. The eyes and noses for dogs, the legs for deer are all easy to figure out. These back-propogation maps at some points show us which pixels matter for the second maxpooling layer in our CNN.
Grad_to_images_stl10(layer_name="Relu_1:0", input_images=images, input_labels=labels, batch_size=100)
Image("./img/back-prop2.png", width = 1200, height = 1200)
Image("./img/back-prop3.png", width = 1200, height = 1200)
From the plots shown above we can see that, the outlines of the main objects for input images are much clearer than the ones shown in saliency maps, or the ones back-propagated from the second maxpooling layer. The eyes for dogs and monkeys, the legs for birds and the wings for airplane are all easy to distinguish. These back-propogation maps at some points show us which pixels matter for the second convolutional layer in our CNN.
In this part, we are trying to visualize activation maps for specific layers with specific images.
We first got the tensor values from the first convolutional layer with some input images, and then showed all the activation maps (total in 96) as 96x96 grayscale images.
plotCNN_actmaps_stl10("Relu:0", input_images=valid_img, image_idx=30)
From the plots shown above we can see that, almost all the activation maps from the first convolutional layer are trying to capture the general outline for the main object of an imput image. Most of them are clear enough to figure out the object itself.
We first got the tensor values from the second convolutional layer with some input images, and then showed all the activation maps (total in 96) as 96x96 grayscale images.
plotCNN_actmaps_stl10("Relu_1:0", input_images=valid_img, image_idx=30)
From the plots shown above we can see that, for the activation maps from the second convolutional layer, the general outline for the object is no longer clear. But one thing important is that, the position of an object is definitely captured in most of the activation maps from this layer.
The idea of two ways to visualize last hidden layer is from Stanford CS231n Lecture 12.
In last layer, we have 10 class scores for each of classes of images. And immediately before last layer, we have another fully connected layer -- FC_layer2, after which we get a tensor h_fc2. h_fc2 represents 192-dimensional features of images that would be fed into last layer to predict class scores.
Here we want to visualize and understand our convolutional neural network by looking at these 192-dimensional features. First we run some images through our trained network, and get their 192-dimensional features, and then we take two methods to visualize:
t-SNE
Nearest Neighbors
t-SNE is a dimensionality reduction method, which takes high dimentional features, and compress to low dimensional features like two dimensions, so that we can visualize features in two dimensional coordinates.
In this part, we randomly selected 900 images, run the images through our ConvNet, and got 192-dim features for each image. And then, we showed the reduced 2-dim t-SNE features in coordinate, from which we can see cluster of classes. And finnaly, we put orinal images in a large merged image according to the value of reduced 2-dim t-SNE features.
Tool: TSNE function from from sklearn.manifold
Reference: t-SNE visualization of CNN codes, tsne-visualization
First we visualized all 10 classes of images.
The labels of classes and their corresponding class names: {0: 'plane', 1: 'bird',2: 'car', 3: 'cat', 4: 'deer', 5: 'dog', 6: 'horse', 7: 'monkey', 8: 'ship', 9: 'truck'}
From cluster plot, we could see distinct boundary between vehicles(ship, airplane, car, etc..) and animals(monkey, cat, deer, etc..). And for each specific class, we could see the clusters, though some classes have some overlap.
And for the merged image, we can also easily see the clusters of vehicles (like car, plane and ship) and animals (with green background). Since the class number is 10, the detailed cluster is not apparent. We will show specific 3 classes in next part, where the visualization is more clearly.
random_images,random_labels,idx = RandomImagesNeeded(images, labels, 900)
features = Extract192Features(random_images, random_labels)
tSNE_show(images,features, random_labels, idx)
We've run the above lines of code and saved the result, so here we just load the saved plot and image.
Image("./img/all_1_cluster.png", width = 300, height = 300)
Image("./img/all_1.png", width = 1200, height = 1200)
In this part, we focused on three specific classes: {0: airplane, 2: car, 7: monkey}, so that we could show the result more clearly.
From the cluster plot we could easily see three clusters of classes.
And from merged image, we could also see clear boundaries of three classes: top green part being monkeys, middle blue part being airplanes, and bottom part being cars.
random_images,random_labels,idx = RandomImagesNeeded(images, labels, 900, [0,2,7])
features = Extract192Features(random_images, random_labels)
tSNE_show(images,features, random_labels, idx)
We've run the above lines of code and saved the result, so here we just load the saved plot and image.
Image("./img/023_3_cluster.png", width = 300, height = 300)
Image("./img/023_3.png", width = 1200, height = 1200)
In this part, for a given test image, we want to find its nearest neighbors (the most similar images) in the whole image dataset. By defining nearest neighbors, we mean the l2 euclidean distance between feature vector of test image and feature vector of all other images. And there are two ways to calculate l2 distance:
(1) put aside our ConvNet model, calculate l2 distance based on pixel space of images
(2) use 192-dim vector from last hidden layer as feature vector, calculate l2 distance based on 192-dim features
We will compare these two results and show that 192-dim features from our ConvNet extract meaningful semantic content information.
Below we show nearest neighbors of test images based on pixel space. Each row shows one test image and its 7 nearest neighbors.
For example, the first test image is a bird, but its nearest neighbors shown here are ships and planes, instead of birds. This method only compares raw pixels in corresponding locations, not capturing the main information of images. So for the bird image, since large part of image is sky, we got many images with sky being the background like plane and ship.
images_reshape = images.reshape(13000, 96*96*3)
FindNN([223, 955, 6666, 9550, 138, 777, 8888, 2222, 222, 99], images, images_reshape, raw_labels, 8)
Below we show nearest neighbors of test images based on 192-dim features from last hidden layer. Each row shows one test image and its 7 nearest neighbors.
We can see the result below is quite good. For each test image, the nearest neighbors are mostly the same kind of objects that are most similar to test image. The result here is quite different from pixel space nearest neighbors. Here the pixels of test image and nearest neighbors are not quite similar, but the semantic contents are quite similar, which shows that the model captures important information of the image.
For example, in the second row, the test image is a deer facing right, while deers in the nearest 1, 2 and 4 images face left. The pixels in corresponding locations could be quite different since the orietation of deers are opposite, but in the feature space which is learned by the ConvNet, they are close to each other, which shows the ConvNet actually captures semantic content of the image.
#all_features = Extract192Features(images,labels)
#np.save('./output/all_features',all_features)
all_features = np.load('./output/all_features.npy')
FindNN([223, 955, 6666, 9550, 138, 777, 8888, 2222, 222, 99],images, all_features,raw_labels,8)
In a CNN, each Conv layer has several learned template matching filters that maximize their output when a similar template pattern is found in the input image. The general idea of activation maximization is to generate an input image which maximizes the filter output activations. By minimizing losses during gradient descent iterations, we can understand what sort of input patterns activate a particular filter.
To achieve this goal, first we need to define a loss function to calcualte Activation Maximization Loss. We ramdomly intialize an image as the start point of our gradient descent interation process. In each interation, we calculate the gradient of loss with respect to input image, then use it to update the input in order to reduce Activation Maximization Loss.
# Network here is the same to what we use in tensorflow
from activation_max import *
from keras.layers import Conv2D, MaxPooling2D, Dropout, Dense, Flatten, Activation, Input
from keras.models import Sequential
from keras import backend as K
IMG_WEIGHT, IMG_HEIGHT = 96, 96
weights_path = './output/keras_model_weights.h5'
model = get_model(weights_path)
import matplotlib.pyplot as plt
if K.image_data_format() == 'channels_first':
input_shape = (3, IMG_WEIGHT, IMG_HEIGHT)
elif K.image_data_format() == 'channels_last':
input_shape = (IMG_WEIGHT, IMG_HEIGHT, 3)
#init_img = np.random.random(input_shape) * 20 + 128.0
init_img = np.zeros(input_shape)
img = deprocess_image(init_img)
class_idx = 3
img = init_img[np.newaxis, :]
new_img = visualize_activation_basic(model, class_idx, img, n_iteration=100, verbose=False)
new_img = deprocess_image(new_img)
plt.imshow(new_img[0,:])
plt.title("Generated image of Preliminary Trial for class idex = 3")
plt.show()
There exist two problems in our preliminary trial above. Firstly, the image is updated at a too slow rate that we can't achieve a relevant ideal loss within 500 interation, even if we set the learning rate as 10. Secondly, images we generated above seem to be unrecognizable by humans. This is an issue regarded as the enemy of feature visualization according to Chris Olah from Google team (https://distill.pub/2017/feature-visualization/). It often ends up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.
To solve these problems, we explored several regularization methods and combined them to enhance the effect. Generally speaking, there are two major ideas. One is to apply some modification to the image after each optimization step so that the algorithm tends toward nicer images. The other is to modify the loss so that the learning process favors more natural images over unnatural ones.
8.2.1 Decay
This regularization method refers to the paper Understanding Neural Networks Through Deep Visualization (https://arxiv.org/abs/1506.06579). A simple regularization is to make the image closer to the mean at each step. It avoids bright pixels with very high values by penalizing large values. Decay tends to prevent a small number of extreme pixel values from dominating the example image. Such extreme single-pixel values neither occur naturally with great frequency nor are useful for visualization.
8.2.2 Blur
This regularization method refers to the paper Understanding Neural Networks Through Deep Visualization (https://arxiv.org/abs/1506.06579). Producing images via gradient ascent can arrive at high activations, but they are neither realistic nor interpretable. A useful regularization is thus to penalize high frequency information to make the image smoother.
8.2.3 Clipping pixels with small norm
This regularization method refers to Understanding Neural Networks Through Deep Visualization (https://arxiv.org/abs/1506.06579). After applied two regulations above, even if some pixels show the primary object or type of input, the gradient with respect to all other pixels will still generally be non-zero, so these pixels will also shift to show some pattern as well.Our goal in this step is the clip pixels with small norms so that we can remove their effect on maximizing the activation.
8.2.4 Bounded variation
This regularization method refers to In Visualizing deep convolutional neural networks using natural pre-images (Aravindh Mahendran and Andrea Vedaldi, 2016). It uses total variation (TV) of the image, encouraging reconstructions to consist of piece-wise constant patches. For a discrete image x, the TV norm is approximated using finite differences as follows:
$R_{TV^{\beta}}(x)=\frac{1}{HWV^{\beta}}\sum((x(v,u+1,k)-x(v,u,k))^2+(x(v+1,u,k)-x(v,u,k)))^2)^{\frac{\beta}{2}}$
where $\beta$ = 1. Here the constant V in the normalization coefficient is the typical value of the norm of the gradient in the image.
8.2.5 Bounded Range
This regularization method refers to In Visualizing deep convolutional neural networks using natural pre-images (Aravindh Mahendran and Andrea Vedaldi, 2016). It encourages the intensity of pixels to stay bounded. According to this paper, in activation maximization, it is even more important for networks that do not include normalization layers, as in this case increasing the image range increases neural activations by the same amount.
def gradient_ascent_iter(loss_fn, img, step=1.0, verbose=False):
loss_value, grads_value = loss_fn([img])
if verbose: print("Loss: {}".format(loss_value))
gradient_ascent_step = img + grads_value * step
grads_row_major = np.transpose(grads_value[0, :], (1, 2, 0))
img_row_major = np.transpose(gradient_ascent_step[0, :], (1, 2, 0))
img_row_major = blur_regularization(img_row_major, size=(3,3))
img_row_major = decay_regularization(img_row_major, decay=0.9)
img_row_major = clip_weak_pixel_regularization(img_row_major, grads_row_major)
img = np.float32([np.transpose(img_row_major, (2, 0, 1))])
return img
def visualize_activation(model, class_idx, init_img, n_iteration=10, verbose=True,
tv_weight=0.1, lp_weight=0.1):
model.layers[-1].activation = activations.linear
layer_output = model.layers[-1].output
input_tensor = model.input
activation_loss = layer_output[:, class_idx]
tv_regularizer = total_variation_regularizer(input_tensor)
norm_regularizer = lp_regularizer(input_tensor, p=2)
total_loss = activation_loss #- lp_weight * norm_regularizer - tv_weight * tv_regularizer
grads = K.gradients(total_loss, input_tensor)[0]
iterate = K.function([input_tensor], [total_loss, grads])
img = init_img
for i in range(n_iteration):
if verbose: print("iteration: {0}".format(i))
img = gradient_ascent_iter(iterate, img, verbose=verbose,step=1)
return img
g_images = []
fig_act = plt.figure(figsize=(36,15))
for class_idx in range(10):
img = init_img[np.newaxis, :]
new_img = visualize_activation(model, class_idx, img, n_iteration=300, verbose=False)
g_images.append(deprocess_image(new_img))
plt.subplot(2,5,class_idx+1)
plt.imshow(new_img[0,:])
plt.title("class %d" %(int(class_idx)+1),fontsize=30)
plt.show()
Unfortunately, the images generated above look still not natural, while they are much better than those from our preliminary trial. It may leads to a doubt whether we implement Activation Maximization correctly. We do further work to check whether our code can successfully generate an image with Activation Maximization to help us achieve goal of feature visualization. It turns out that our code does work.
8.3.1 High Confidence Predictions for Unrecognizable Images
We firstly check the score output when we put these images into our model.
for class_idx in range(10):
scores = model.predict(g_images[class_idx])
print("score distribution when we set class as %d"% class_idx )
print(scores)
As we can see, these images all have a very high confidence rate in their respective classes. In other words, images score very high for a single class even if they are unrecognizable by humans.
8.3.2 Output on MNIST looks much better
We also implement our code on the classic dataset MNIST. Below is its output. To make this report more concise, we only put one of the output here.
from IPython.display import Image
Image("3.png",retina = True)
The different performance of our method on different model reveals the possibility of its limieted scope.
We used Activation Maximization to generate images for visualizing features. After preliminary trial, we also explored and combined 5 regularization to improve recognition. Given the fact that images were still hard to recognize for haman beings, we did further exploration.
To confirm that we implemented the method correctly, we tested the output score of images from each class, as well as applied it on dataset MNIST and compared the performance. They both proved that our method could successfully generate an image given specific class, which achieve a high score for the assigned class. Besides, its relevant good performance on MNIST reveals that this method might fit some dataset better. We may expore this issue in the future.
MNIST dataset is widely used as start point for deep leanring problems. MNIST dataset contains black and white images of handwritten digits with 28 x 28 pixles. It has 60,000 training images 10,000 test images.
Note: Detailed functions related to this dataset are put in MNIST.py.
from MNIST import *
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../data/mnist', one_hot=True)
First, we built a netwrok that is pretty standard as we did in class --- two convolutional layers, each with 5x5 maxpooling and a relu gate, followed by one fully connected layer, one drop out layer, one more fully connected layer and a softmax classifier. The outputs for the last fully connected layer were 10 scores for each input images.
We have already trained this network for 1001 steps with roughly 98% accuracy at the end. It's pretty good. So in the following main parts of features visualization, we can directly load the saved model, containing the best weight and bais values.
############# 7-Layer CNN #############
############# cpcpfdf #############
############# CIFAR10 #############
n1 = 32
n2 = 64
n3 = 1024
x = tf.placeholder(tf.float32, [None, 784], name='x')
y = tf.placeholder(tf.float32, [None, 10], name='y')
pkeep = tf.placeholder(tf.float32, name='pkeep')
x_image = tf.reshape(x, [-1,28,28,1])
# CNN_layer1
W_conv1 = tf.get_variable('W_conv1', shape=[5, 5, 1, n1])
b_conv1 = tf.get_variable('b_conv1', shape=[n1])
h_conv1 = tf.nn.relu(tf.add(conv2d(x_image, W_conv1), b_conv1))
# Pool_layer1
h_pool1 = max_pool22(h_conv1)
# CNN_layer2
W_conv2 = tf.get_variable('W_conv2', shape=[5, 5, n1, n2])
b_conv2 = tf.get_variable('b_conv2', shape=[n2])
h_conv2 = tf.nn.relu(tf.add(conv2d(h_pool1, W_conv2), b_conv2))
# Pool_layer2
h_pool2 = max_pool22(h_conv2)
# FC_layer1
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*n2])
W_fc1 = tf.get_variable('W_fc1', shape=[7*7*n2, n3])
b_fc1 = tf.get_variable('b_fc1', shape=[n3])
h_fc1 = tf.nn.relu(tf.add(tf.matmul(h_pool2_flat, W_fc1), b_fc1))
# Dropout_layer1
h_fc1_drop = tf.nn.dropout(h_fc1, pkeep)
# FC_layer2
W_fc2 = tf.get_variable('W_fc2', shape=[n3, 10])
b_fc2 = tf.get_variable('b_fc2', shape=[10])
logits = tf.add(tf.matmul(h_fc1_drop, W_fc2), b_fc2, name='CNN2_logits')
loss = compute_cross_entropy(logits=logits, y=y)
accuracy = compute_accuracy(logits, y)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
#train_MNIST(mnist, train_step, x, y, pkeep, accuracy, 100, 1001)
In this part we visualized first convolutional layer. In our ConvNet it is tensor h_conv1 with shape [1, 28, 28, 32]. So we visualized it as 32 28x28 images.
We took an image of number 7 as input, and run it through our trained model, got h_conv1.
Since the number image is simple, we can see clear shape of number '7' in each activation maps of first convolutional layer.
plt.imshow(mnist.test.images[0].reshape(28,28))
tf.reset_default_graph()
VisualizeCNNLayer1(mnist.test.images[0])
In this part we visualized second convolutional layer. In our ConvNet it is tensor h_conv2 with shape [1, 14, 14, 32]. So we visualized it as 32 14x14 images.
We can still see the shape of '7' from second convolutional layer, but they are not as clear as fist convolutional layer. So it's more difficult to understand higher layers of a ConvNet.
VisualizeCNNLayer2(mnist.test.images[0])
In this part we visualized weight of first convolutional layer. In our ConvNet it is tensor W_conv1 with shape [1, 5, 5, 32]. So we visualized it as 32 5x5 images.
The weights are more difficult to understand, usually the weights of first convolutional layer would show something like oriented edges.
VisualizeWeightWconv1(mnist.test.images[0])
Note: We have used cifar10.py and some other related packages available from Havass-Labs to import the dataset, which provided pixel values and ont-hot encoding labels for all available images. All our detailed functinos are in CIFAR10_all.py.
import cifar10
import dataset
from CIFAR10_all import *
cifar10.data_path = "./data/CIFAR-10/"
class_name = cifar10.load_class_names()
print(class_name)
train_images_cifar10, train_class_cifar10, train_labels_cifar10 = cifar10.load_training_data()
test_images_cifar10, test_class_cifar10, test_labels_cifar10 = cifar10.load_test_data()
print("Size for the training images: {0}".format(len(train_images_cifar10)))
print("Shape for the training images: {0}".format(train_images_cifar10.shape))
print("Shape for the training labels: {0}".format(train_labels_cifar10.shape))
print("Size for the test images: {0}".format(len(test_images_cifar10)))
print("Shape for the test images: {0}".format(test_images_cifar10.shape))
print("Shape for the test labels: {0}".format(test_labels_cifar10.shape))
First, we defined and trained a convolutional network the same as for STL-10 dataset. We built a netwrok that is pretty standardforward --- two convolutional layers, each with 3x3 maxpooling and a relu gate, followed by three fully connected layers and a softmax classifier. The outputs for the last fully connected layer were 10 scores for each input images.
We have already trained this network for 30001 steps with roughly 74% accuracy at the end. It's good. So in the following main parts of features visualization, we can directly load the saved model, containing the best weight and bais values.
############### 7-Layer CNN ##############
######## AlexNet (some changes) ##########
################ CIFAR10 ###############
n1 = 96
n2 = 96
n3 = 384
n4 = 192
x = tf.placeholder(tf.float32, [None, 32, 32, 3], name='x')
y = tf.placeholder(tf.float32, [None, 10], name='y')
# CNN_layer1
W_conv1 = tf.get_variable('W_conv1', shape=[5, 5, 3, n1])
b_conv1 = tf.get_variable('b_conv1', shape=[n1])
h_conv1 = tf.nn.relu(tf.add(conv2d(x, W_conv1), b_conv1))
# Pool_layer1
h_pool1 = max_pool33(h_conv1)
# CNN_layer2
W_conv2 = tf.get_variable('W_conv2', shape=[3, 3, n1, n2])
b_conv2 = tf.get_variable('b_conv2', shape=[n2])
h_conv2 = tf.nn.relu(tf.add(conv2d(h_pool1, W_conv2), b_conv2))
# Pool_layer2
h_pool2 = max_pool33(h_conv2)
# FC_layer1
h_pool2_flat = tf.reshape(h_pool2, [-1, 8*8*n2])
W_fc1 = tf.get_variable('W_fc1', shape=[8*8*n2, n3])
b_fc1 = tf.get_variable('b_fc1', shape=[n3])
h_fc1 = tf.nn.relu(tf.add(tf.matmul(h_pool2_flat, W_fc1), b_fc1))
# FC_layer2
W_fc2 = tf.get_variable('W_fc2', shape=[n3, n4])
b_fc2 = tf.get_variable('b_fc2', shape=[n4])
h_fc2 = tf.nn.relu(tf.add(tf.matmul(h_fc1, W_fc2), b_fc2))
# FC_layer3
W_fc3 = tf.get_variable('W_fc3', shape=[n4, 10])
b_fc3 = tf.get_variable('b_fc3', shape=[10])
logits = tf.add(tf.matmul(h_fc2, W_fc3), b_fc3)
#loss = compute_cross_entropy(logits=logits, y=y)
#accuracy = compute_accuracy(logits, y)
#train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
# Validation set
batch_all = random_batch(data_images=test_images_cifar10, data_labels=test_labels_cifar10, batch_size=100)
valid_img = batch_all[0]
valid_lab = batch_all[1]
#train_CNN_cifar10(input_images=train_images_cifar10, input_labels=train_labels_cirfar10, train_step, batch_size=100, num_step=30001)
tf.reset_default_graph()
with tf.Session() as sess:
# Load the saved model
new_saver = tf.train.import_meta_graph("./output/cifar10/-30000.meta")
new_saver.restore(sess, tf.train.latest_checkpoint('./output/cifar10/'))
In this part we are trying to visualize the maximal activation maps for different input images in each convolutional layer. The same idea as for STL-10 dataset.
display_max_activations_CONV1_cifar10(input_images=train_images_cifar10, input_labels=train_labels_cifar10, batch_size=100, n1=96)
From the plots shown above we can see that, due to the tiny sizes for CIFAR-10 dataset, it is hard to figure the object from raw images. The first convolutional layer have tried to learn some outstanding patterns from input images, but it seemed to be unsuccessful. For some raw images, which comparatively more distinguishable than others, their maximal activation maps have successfully figured out some parts of the objects, such as the bodies and heads for birds, the legs for dogs and the wings for airplanes.
As you can see, No.1 and No.59 activation map has been shown for most the images, which at some points indicated that this filter activated maximally for most of the pictures given, not only capturing the purple and blue colors, but some parts of the objects from the raw images.
display_max_activations_CONV2_cifar10(input_images=train_images_cifar10, input_labels=train_labels_cifar10, batch_size=100, n1=96, n2=96)
From the plots shown above we can see that, the second convolutional layer tried to distingush the outlines of objects from the backgrounds by the presence of distinct color patterns. But, obviously it's not very successful. For some specific classes, such as airplane, due to comparatively distinct outlines in raw images, the reconstructions for those images from the second convolutional layer were much better.
In this part, we are trying to do saliency visualizations for given images. The same idea as for STL-10 dataset.
Grad_to_images_cifar10(layer_name="Add_4:0", input_images=train_images_cifar10, input_labels=train_labels_cifar10, batch_size=100)
From the plots shown above we can see that, the outlines of the main objects for input images are not very clear, at some points due to the pool visualization for the raw images, but still can figure something out with raw images. These saliency maps at some points tell us which pixels matter for the classifications.
In this part, we are trying to do back-propagation visualization for given images from a specific layer in CNN. The same idea as for STL-10 dataset
Grad_to_images_cifar10(layer_name="MaxPool_1:0", input_images=train_images_cifar10, input_labels=train_labels_cifar10, batch_size=100)
From the plots shown above we can see that, the outlines of the main objects for input images are not very clear, the same for the raw imags. These back-propogation maps at some points show us which pixels matter for the second maxpooling layer in our CNN.
Grad_to_images_cifar10(layer_name="Relu_1:0", input_images=train_images_cifar10, input_labels=train_labels_cifar10, batch_size=100)
From the plots shown above we can see that, the outlines of the main objects for input images are not very clear, due to the poor visualization for input images. These back-propogation maps at some points show us which pixels matter for the second convolutional layer in our CNN.
In this part, we are trying to visualize activation maps for specific layers with specific images. The same idea as for STL-10 dataset.
plotCNN_actmaps_cifar10(layer_name="Relu_1:0", input_images=valid_img, image_idx=50)
From the plots shown above we can see that, the activation maps from the first convolutional layer are trying to capture the general outline for the main object in an imput image. Most of them are not successful, which possibly due to the poor inputs.
Especially for maps No.6, No.14, No.43 and No.58, etc., they have comparatively successed in capturing the general outlines for this input image.
plotCNN_actmaps_cifar10(layer_name="Relu:0", input_images=valid_img, image_idx=50)
From the plots shown above we can see that, for some specific activation maps from the second convolutional layer, the general outlines for this imput image are pretty clear, such as for maps No.25, No.38, No.41, No.67 and No.85, etc.. This situation is quite interesting and needs futher investigation.